Fill in your name, student id number and email address¶

name: Peppi-Lotta Saari¶

student id: 517334¶

email: plsaar@utu.fi¶

Data analysis and knowledge discovery - Exercise 3: Unsupervised learning¶

This is the template for the third exercise. The purpose of this exercise is to familiarize yourself with the basics of unsupervised learning by using the agglomerative hierarchical clustering and k-means clustering algorithms to find patterns.

The data set utilised in this exercise is a simplified and downsampled version of a knowledge discovery and data mining competition data set. The data will be available on the course's Moodle page. For those who are interested, the original data can be found at https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. However, please make sure to use the version on Moodle as ex3_network_data.csv. The data is described below.

The data set contains samples of network activity simulated in a military network environment. There are different types of malicious activity, and also activity that is considered normal. It is not necessary to understand the details of the data set in order to complete the exercise.

In addition to normal activity, there are 3 types of malicious activity - denial of service, unauthorized remote login, and network probing (e.g. port scanning) - simulated in a military network environment. There are 500 samples of each class. There are 6 numeric features, described below:

src_bytes: number of bytes from source to destination
dst_bytes: number of bytes from destination to source
duration: length of connection (seconds)
count: number of connections to the same host as the current connection in the past two seconds
serror_rate: percentage of connections that have SYN errors
rerror_rate: percentage of connections that have REJ errors

In real applications, visualizing and cleaning the data are important steps. However, in this exercise you can treat the data as given, and focus on the unsupervised methods.

Please consider the following things when returning your notebook:

  • As in the two previous exercises, the grading scale is failed/passed/passed with honors.

  • For a passing grade each part of the exercise, except for the BONUS, must be completed, and all questions should be answered. Some mistakes are allowed as long as you clearly attempt to solve all the exercises.

  • For doing both the exercise and the optional bonus task sufficiently well, you will be awarded one bonus point for the exam.

  • All the cells in the finished notebook should run without crashing. Please delete unnecessary cells. As a good rule of thumb, use "Restart and run all" on the finished notebook to make sure it runs without errors and produces the expected output.

  • Remember to comment your code to explain how it works and what you intend for it to do.

  • Answer the questions asked in the assignments in Markdown cells.

  • If you are having problems with this exercise, try an online search first, but don't just copy-paste any code you find. See exercise guidelines in the Moodle page of this course. If you can't find a solution to your problem, ask for advice in the course discussion forum on Moodle or contact oskari.s.heikkinen@utu.fi.

  • If/when you look things up during this exercise, please cite your sources (e.g. a link to a web page). It's better to cite too much than too little.

Library imports, Jupyter Notebook settings etc.¶

The below libraries are sufficient to complete the exercise. You can import additional functionality here if you want.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.metrics import silhouette_score, adjusted_rand_score
from scipy.cluster.hierarchy import linkage, dendrogram

%matplotlib inline

Read the data¶

  • Download the exercise 3 data on the Moodle page of this course.
  • Read the data into a Pandas dataframe.
  • Display a few rows and some basic information to make sure the data was loaded correctly
In [2]:
# Path for the data
data_path = 'ex3_network_data.csv'

# Create a dataframe
network_data = pd.read_csv(data_path)

network_data.head()
Out[2]:
src_bytes dst_bytes duration count serror_rate rerror_rate class
0 0 0 0 223 1.0 0.0 denial_of_service
1 1032 0 0 511 0.0 0.0 denial_of_service
2 520 0 0 511 0.0 0.0 denial_of_service
3 1032 0 0 510 0.0 0.0 denial_of_service
4 520 0 0 448 0.0 0.0 denial_of_service
In [3]:
network_data.dtypes
Out[3]:
src_bytes        int64
dst_bytes        int64
duration         int64
count            int64
serror_rate    float64
rerror_rate    float64
class           object
dtype: object
In [4]:
network_data.describe()
Out[4]:
src_bytes dst_bytes duration count serror_rate rerror_rate
count 2000.000000 2000.00000 2000.000000 2000.000000 2000.000000 2000.00000
mean 402.324500 382.56600 20.246500 111.158000 0.059415 0.14014
std 464.815917 861.81793 241.867267 198.399713 0.234554 0.34661
min 0.000000 0.00000 0.000000 1.000000 0.000000 0.00000
25% 8.000000 0.00000 0.000000 1.000000 0.000000 0.00000
50% 286.500000 0.00000 0.000000 2.000000 0.000000 0.00000
75% 748.000000 146.00000 0.000000 105.250000 0.000000 0.00000
max 4703.000000 4982.00000 4776.000000 511.000000 1.000000 1.00000
In [5]:
network_data['class'].value_counts()
Out[5]:
denial_of_service            500
normal                       500
probe                        500
unauthorized_remote_login    500
Name: class, dtype: int64

Part 1: Preprocess and visualize the data¶

  • Perform z-score standardization on the features to ensure that all features have the same scale.

  • Project the data to two dimensions by using principal component analysis (PCA) and visualize the resulting two-dimensional data in a scatter plot. Don't color the scatter plot yet.

  • Does it look like there are clear clusters? Don't worry if they're hard to see.

In [6]:
#Standardizatio
#I copied this from my own exercise2 anwers

#initialize 
network_numeric_features = network_data.select_dtypes(include='number')
numeric_features = list(network_numeric_features.columns)

#scale
scaler = StandardScaler().fit(network_numeric_features)
network_numeric_features_scaled = scaler.transform(network_numeric_features)

# make dataframe again
network_numeric_features_scaled = pd.DataFrame(network_numeric_features_scaled)

# insert column names back
network_numeric_features_scaled.columns = numeric_features

# change feature_train numeric values to standardized 
network_data_copy = network_data.copy()
network_data_copy[numeric_features] = network_numeric_features_scaled
network_data_scaled = network_data_copy

# display
display(network_data_scaled.head()) #all data
display(network_numeric_features_scaled.head()) #just numeric values
src_bytes dst_bytes duration count serror_rate rerror_rate class
0 -0.865773 -0.444017 -0.08373 0.563862 4.011105 -0.404417 denial_of_service
1 1.355016 -0.444017 -0.08373 2.015840 -0.253374 -0.404417 denial_of_service
2 0.253229 -0.444017 -0.08373 2.015840 -0.253374 -0.404417 denial_of_service
3 1.355016 -0.444017 -0.08373 2.010798 -0.253374 -0.404417 denial_of_service
4 0.253229 -0.444017 -0.08373 1.698219 -0.253374 -0.404417 denial_of_service
src_bytes dst_bytes duration count serror_rate rerror_rate
0 -0.865773 -0.444017 -0.08373 0.563862 4.011105 -0.404417
1 1.355016 -0.444017 -0.08373 2.015840 -0.253374 -0.404417
2 0.253229 -0.444017 -0.08373 2.015840 -0.253374 -0.404417
3 1.355016 -0.444017 -0.08373 2.010798 -0.253374 -0.404417
4 0.253229 -0.444017 -0.08373 1.698219 -0.253374 -0.404417
In [7]:
# This is done based on TKO_7093 Statistical Data Analysis courses PCA example code

# quantitative features (standardized)
network_numeric_features_scaled

# create PCA model
pca = PCA().fit(network_numeric_features_scaled)

# transform data into new space
network_numeric_pca = pd.DataFrame(pca.transform(network_numeric_features_scaled))

# visualise the first two PCA components
network_numeric_pca.plot.scatter(0, 1, figsize=(12,10))

# add transformed data back to original data frame
network_data_scaled_extended = pd.concat([network_data_scaled, network_numeric_pca], axis=1)

I can see three clear clusters. Two thinish clusters in the bottom left corner and one bigger cluster spanning to the top. These stand out here but their relationship is not clear from this clustering.

Because clustering is an unsupervised learning method, the class column is completely unnecessary for most of these tasks. You will only need the class column in Part 4, where it's used to compute a performance metric and to visually compare clustering results to the classes.

Part 2a: Agglomerative hierarchical clustering¶

  • Cluster the data into 4 clusters using agglomerative hierarchical clustering. Try different values for the "linkage" parameter.

  • Use the z-score standardized 6-dimensional data for clustering - not the principal components!

  • What is the significance of the linkage criterion in a hierarchical clustering algorithm?

  • Evaluate the clustering performance for each linkage criterion using a metric called "silhouette score".

  • What does silhouette score quantify and how is it computed?

In [8]:
#https://www.youtube.com/watch?v=RjyR1IRm9VA
link = ['ward', 'complete', 'average', 'single'] #linkage types
for l in link: 
    #initialize model
    clustering = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage=l)
    #fit
    clustering.fit(network_numeric_features_scaled)
    #silhouette score
    membership = clustering.labels_
    print(l, ": ", silhouette_score(network_numeric_features_scaled, membership))
ward :  0.6299185426636938
complete :  0.48157598037192545
average :  0.6241907400676748
single :  0.7288757664397134

What is the significance of the linkage criterion in a hierarchical clustering algorithm?
https://statsandr.com/blog/files/Hierarchical-clustering-cheatsheet.pdf
Linkage criterion is a mesure of similarity between clusters.

What does silhouette score quantify and how is it computed?
https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c#:~:text=Silhouette%20Coefficient%20or%20silhouette%20score%20is%20a%20metric%20used%20to,each%20other%20and%20clearly%20distinguished.
Silhouette score represents if the clusters are clearly apart from eachother and well distinguished.

Part 2b: Dendrograms¶

  • Plot dendrograms to visualize the merging processes.
  • For this you will need a linkage matrix. Hint: while you can extract one from a fitted AgglomerativeClustering object, it is much easier to use the scipy implementation (scipy.cluster.hierarchy.linkage).
  • Compute the linkage matrix using both average and complete linkage, and plot the dendrograms using scipy.cluster.hierarchy.dendrogram).
  • Truncate the dendrogram so that three levels of the dendrogram tree are visible for better readability.
  • How do you interpret the dendrograms? How do they differ?
In [9]:
x = 1
link = ['complete', 'average'] #linkage types
fig, axs = plt.subplots(1, 2, figsize=(16, 7))
for l in link: 
    linkage_data = linkage(network_numeric_features_scaled, l)
    plt.subplot(1,2,x)
    dendrogram(linkage_data, truncate_mode='level', p=3)
    x = x + 1 
    plt.title(l)

plt.show()

How do you interpret the dendrograms? How do they differ?
https://wheatoncollege.edu/wp-content/uploads/2012/08/How-to-Read-a-Dendrogram-Web-Ready.pdf
The vertical lines lengths tell us how different the clusters are from each other. Using the complete linkage seems to be doing a better job at the clustering as the lengths are longer. This is at least when observing the third level. The average seems to be making more distinct clusters on the lover levels we see in the dendrograms.

Part 3: k-means clustering¶

  • Perform k-means clustering on the data. Use 4 clusters.
  • Evaluate the clustering performance using silhouette score.
  • Experiment with some other numbers of clusters. Does the data fit better into a different number of clusters according to silhouette score?
In [10]:
# # This is done based on TKO_7093 Statistical Data Analysis courses clustering example code
# create k-means model with four clusters
kmeans = KMeans(n_clusters=4).fit(network_numeric_features_scaled)
kmeans_labels = kmeans.labels_

print("kmeans silhouette score: ", silhouette_score(network_numeric_features_scaled, kmeans_labels))
kmeans silhouette score:  0.634040977469723
In [11]:
for x in range(2,50):
    kmeans = KMeans(n_clusters=x).fit(network_numeric_features_scaled)
    kmeans_labels = kmeans.labels_

    print(x ,": ", silhouette_score(network_numeric_features_scaled, kmeans_labels))
2 :  0.4247796965403217
3 :  0.5419542961632046
4 :  0.634040977469723
5 :  0.6581598820518922
6 :  0.7547927328720325
7 :  0.7727553051521152
8 :  0.7870033283213129
9 :  0.7686670126482058
10 :  0.748197467722064
11 :  0.7688513353507218
12 :  0.7563718309946389
13 :  0.7669658431130512
14 :  0.7860679698072327
15 :  0.7782966744801677
16 :  0.7861532672922883
17 :  0.8027367280714405
18 :  0.8011071216944013
19 :  0.802489199690086
20 :  0.8012209292782488
21 :  0.8004592398541462
22 :  0.8000588459512196
23 :  0.8018664437698471
24 :  0.8012506351117956
25 :  0.8034603698393512
26 :  0.8031815584520139
27 :  0.8089981677081105
28 :  0.810056665229098
29 :  0.805090013073449
30 :  0.8111623579893673
31 :  0.8099840544161045
32 :  0.8052821894291811
33 :  0.8020833340665822
34 :  0.8108935605496944
35 :  0.8185883852263796
36 :  0.793636784997507
37 :  0.8015330495521672
38 :  0.8001217550377581
39 :  0.8112948457888817
40 :  0.8105451076725799
41 :  0.8078645058836601
42 :  0.8114920403740162
43 :  0.8105834189033932
44 :  0.8125227591188027
45 :  0.8084332871923232
46 :  0.8083716682446531
47 :  0.806828492185299
48 :  0.8052740944777926
49 :  0.8114277635088372

Experiment with some other numbers of clusters. Does the data fit better into a different number of clusters according to silhouette score? Adding more clusters doesn't seem to make the score better at the long run. Values seem to stay in the range of 0.785-0.815. 14 clusters is the first one to give a value between this range.

Rand score briefly described¶

Rand score is a measure of similarity between two partitions of a set of elements - in this case true classes and clusters found by the clustering algorithm - and it is one of the most frequently used performance metrics for clustering. It is computed by considering each pair of elements in the dataset and counting pairs of elements as follows:

     a: number of pairs such that the elements are in the same class and in the same cluster
     b: number of pairs such that the elements are in different classes and in different clusters
     c: number of pairs such that the elements are in the same class but in different clusters
     d: number of pairs such that the elements are in different classes but in the same cluster

 Given a, b, c, d, the formula for rand index is:

     rand_index = (a+b)/(a+b+c+d).

"Adjusted Rand index" is corrected for chance by using maximum and expected values of Rand index.

    adj_rand_index = (rand_index - expected_rand_index) / (max_rand_index - expected_rand_index)

Part 4a: Compare the clusters with the true labels (hierarchical clustering)¶

  • Cluster the data into 4 clusters using agglomerative hierarchical clustering.
  • Choose the linkage criterion that had the best silhouette score performance in Part 2a.
  • Visualize the data again using PCA, this time coloring the scatter plot based on the true class labels. Visually compare the two scatter plots: how well do the clusters found by the clustering algorithm match the true classes? Place the two scatter plots so that they can easily be compared (e.g. in subplots next to each other in the same figure).
  • For an objective evaluation of the clustering, compute the adjusted Rand score (use the scikit-learn implementation) using the true labels and the labels predicted by clustering algorithm. How do you interpret the result?
  • If the results seem unimpressive, don't get discouraged - clustering "real life" data sets to match classes is a difficult task, and a low Rand score does not necessarily mean that you have made a mistake.
In [12]:
clustering = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='single')
clustering.fit_predict(network_numeric_features_scaled)
labels = pd.DataFrame(clustering.labels_)
labels.columns = ['AggLabels']
In [13]:
# This is done based on TKO_7093 Statistical Data Analysis courses PCA example code 
# quantitative features (standardized)
network_numeric_features_scaled

# create PCA model
pca = PCA().fit(network_numeric_features_scaled)

# transform data into new space
network_numeric_pca = pd.DataFrame(pca.transform(network_numeric_features_scaled))

# add transformed data back to original data frame
network_data_scaled_extended = pd.concat([network_data_scaled, network_numeric_pca], axis=1)
network_data_scaled_extended['class'] = pd.Categorical(network_data_scaled_extended['class'])
In [14]:
#agglomarative clustering
network_data_scaled_ext = pd.concat([labels, network_numeric_pca], axis=1)
network_data_scaled_ext['AggLabels'] = pd.Categorical(network_data_scaled_ext['AggLabels'])
network_data_scaled_ext.head()
Out[14]:
AggLabels 0 1 2 3 4 5
0 0 -0.784767 -2.372303 3.040695 -0.741587 1.218450 -0.010308
1 0 2.056440 -1.222154 -0.698612 0.176703 0.232087 -0.189304
2 0 1.306228 -1.295604 -0.535884 0.155114 -0.036722 -0.928571
3 0 2.053927 -1.219385 -0.697587 0.176456 0.230843 -0.186342
4 0 1.147854 -1.121159 -0.471296 0.139573 -0.115129 -0.742002
In [15]:
import plotly.express as px
# plot pca with actuall class colors
fig = px.scatter(network_data_scaled_extended, x=0, y=1, color=network_data_scaled_extended['class'])
fig.show()
In [16]:
# plot pca with agglomerative clustering labels
fig = px.scatter(network_data_scaled_ext, x=0, y=1, color=network_data_scaled_ext['AggLabels'])
fig.show()
In [17]:
# adjusted Rand score
adjusted_rand_score(network_data_scaled_extended['class'], network_data_scaled_ext['AggLabels'])
Out[17]:
0.00011807828578910034

Based on the visual representation and rand score the labeling is bad. Here we see two complitely different clusterings.

Part 4b: Compare the clusters with true labels (k-means clustering)¶

  • Repeat the above steps, but this time using k-means clustering instead of hierarchical clustering.
  • Which performs better according to the adjusted Rand score?
In [18]:
# # This is done based on TKO_7093 Statistical Data Analysis courses clustering example code
# create k-means model with four clusters
kmeans = KMeans(n_clusters=4).fit(network_numeric_features_scaled)

# get memberships of data points in clusters
predictions = kmeans.predict(network_numeric_features_scaled)

# add memberships back to original data frame
network_data_scaled_copy = network_data_scaled_extended.copy()
network_data_scaled_copy['predictions'] = pd.Categorical(predictions)
network_data_scaled_copy.head()
Out[18]:
src_bytes dst_bytes duration count serror_rate rerror_rate class 0 1 2 3 4 5 predictions
0 -0.865773 -0.444017 -0.08373 0.563862 4.011105 -0.404417 denial_of_service -0.784767 -2.372303 3.040695 -0.741587 1.218450 -0.010308 3
1 1.355016 -0.444017 -0.08373 2.015840 -0.253374 -0.404417 denial_of_service 2.056440 -1.222154 -0.698612 0.176703 0.232087 -0.189304 0
2 0.253229 -0.444017 -0.08373 2.015840 -0.253374 -0.404417 denial_of_service 1.306228 -1.295604 -0.535884 0.155114 -0.036722 -0.928571 0
3 1.355016 -0.444017 -0.08373 2.010798 -0.253374 -0.404417 denial_of_service 2.053927 -1.219385 -0.697587 0.176456 0.230843 -0.186342 0
4 0.253229 -0.444017 -0.08373 1.698219 -0.253374 -0.404417 denial_of_service 1.147854 -1.121159 -0.471296 0.139573 -0.115129 -0.742002 0
In [19]:
# plot pca with actuall class colors
fig = px.scatter(network_data_scaled_extended, x=0, y=1, color=network_data_scaled_extended['class'])
fig.show()
In [20]:
# plot pca with agglomerative clustering labels
fig = px.scatter(network_data_scaled_copy, x=0, y=1, color=network_data_scaled_copy['predictions'])
fig.show()
In [21]:
# adjusted Rand score
adjusted_rand_score(network_data_scaled_extended['class'], network_data_scaled_copy['predictions'])
Out[21]:
0.2951546107072655

Which performs better according to the adjusted Rand score?
kmean was much better at clustering compared the agglomerative clustering. This can be seen from the visual as well as the rand score. The Rand score still is not good.

Part 5 (optional BONUS task): Clustering unlabeled data¶

In this task, you are working with data where the classes are not available, given as ex3_seeds_data_BONUS.csv on Moodle. The original data set is available for for those who are interested, but use the slightly modified data on Moodle instead.

In general this is a very challenging and open-ended type of task that requires in-depth domain knowledge for meaningful results. Note, however, that in this exercise you are not required to research the domain in question (e.g. properties of different varieties of wheat). You might need to search for more information related to clustering in order to complete this exercise.

Some of the questions are open-ended and have no correct answer. It's enough to clearly show that you thought about the questions.

  • As in Part 1, z-score standardize the data, project it to 2 dimensions using PCA and visualize the result in a scatter plot.
  • Does the scatter plot look like the data might have a clustered structure? How many clusters do you see?
  • Decide, based on what you've learned about silhouette score and Rand score, which performance metric you should use in this task. Justify your choice.
  • Get an objective evaluation of how many clusters the data most likely has by using your chosen performance metric. Try both k-means clustering and agglomerative hierarchical clustering with different linkage criterions and see which performs best.
  • Visualize (with color) the best-performing result in the PCA scatter plot you created earlier.
In [ ]: